---
title: "Regression Project – Group 3 (Intermediate Report)"
author: "Reema and Muzammil"
execute:
echo: true
embed-resources: true
toc: true
format:
html:
code-link: true
code-tools: true
toc-location: right
df-print: paged
pdf:
number-sections: true
revealjs:
output-ext: "revealjs-PRESENTATION.html"
toc: false
code-line-numbers: false
echo: true
scrollable: true
code-link: true
code-tools: true
df-print: paged
slide-number: true
---
# 1. Introduction
- The aim of this intermediate report is to explore determinants of income in South Austria.
- We focus on data management and descriptive statistics before conducting regression modelling.
- The response variable is net employment income (`py010n`).
- Explanatory variables include gender, citizenship, household size, and age.
- For the intermediate report we restrict to data management and descriptive statistics.
------------------------------------------------------------------------
# 2. Data collection and description
- Source: EU-SILC (European Union Statistics on Income and Living Conditions), Austria.
- Type of data: survey data, representative sample of private households.
- Data format: cross-sectional microdata with social, demographic, and income information.
- Variables used:
- `py010n` — employment income (numeric)
- `age` — age in years (numeric)
- `hsize` — household size (categorical converted to numeric)
- `gender` — male / female (categorical)
- `citizenship` — grouped nationality categories (categorical)
- `region` — Austrian federal region
- Missing value handling:
- Keep only positive values of income (`py010n > 0`)
- Convert `hsize` to numeric
- Remove observations with missing values
- Subsetting:
- Only individuals living in **Styria** and **Carinthia** (South Austria, NUTS-1 region)
------------------------------------------------------------------------
# 3. Load packages
```{r}
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("simFrame")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("forcats")
install.packages("effects")
library(simFrame)
library(dplyr) # data manipulation
library(ggplot2) # visualization
library(tidyr) # data tidying
library(forcats) # factor management
library(effects) # effect plots
```
------------------------------------------------------------------------
# 4. Load data and select variables
```{r}
data(eusilcP)
dat = eusilcP
str(eusilcP)
head(eusilcP)
summary(eusilcP)
```
------------------------------------------------------------------------
# 5. Data preparation
- Filter for South Austria regions (Styria and Carinthia)
- Keep only positive income observations
- Convert household size to numeric
- Remove missing values
- Group citizenship categories if needed
```{r}
dat <- eusilcP %>%
select(py010n, gender, citizenship, hsize, age, region) %>%
filter(region %in% c("Carinthia", "Styria")) %>%
filter(py010n > 0) %>%
na.omit()
dat$gender <- as.factor(dat$gender)
dat$citizenship <- as.factor(dat$citizenship)
dat$hsize <- as.factor(dat$hsize)
model_int <- lm(py010n ~ gender * citizenship + hsize + age, data = dat)
anova(model_int)
summary(model_int)
qqnorm(residuals(model_int))
qqline(residuals(model_int))
plot(allEffects(model_int))
```
------------------------------------------------------------------------
# 6. Descriptive statistics
## 6.1 Numeric summaries
```{r}
summary(dat$py010n)
summary(dat$age)
summary(dat$hsize)
```
## 6.2 Frequency tables
```{r}
table(dat$gender)
table(dat$citizenship)
```
------------------------------------------------------------------------
# 7. Univariate visualizations
## 7.1 Employment income (py010n)
```{r}
ggplot(dat, aes(x = py010n)) +
geom_histogram(bins = 50) +
labs(title = "Histogram of employment income (py010n)")
```
```{r}
ggplot(dat, aes(y = py010n)) +
geom_boxplot() +
labs(title = "Boxplot of employment income (py010n)")
```
## 7.2 Age
```{r}
ggplot(dat, aes(x = age)) + geom_histogram(bins = 30)
```
```{r}
ggplot(dat, aes(y = age)) + geom_boxplot()
```
## 7.3 Household size (hsize)
```{r}
ggplot(dat, aes(x = as.numeric(hsize))) +
geom_histogram(binwidth = 1) +
labs(title = "Histogram of Household Size")
```
```{r}
ggplot(dat, aes(y = hsize)) + geom_boxplot()
```
## 7.4 Gender
```{r}
ggplot(dat, aes(x = gender)) + geom_bar()
```
## 7.5 Citizenship
```{r}
ggplot(dat, aes(x = citizenship)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Frequency of citizenship")
```
------------------------------------------------------------------------
# 8. Bivariate plots (predictors vs response)
## 8.1 Gender vs income
```{r}
ggplot(dat, aes(x = gender, y = py010n)) +
geom_boxplot() +
labs(title = "Income vs gender")
```
## 8.2 Citizenship vs income
```{r}
ggplot(dat, aes(x = citizenship, y = py010n)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Income vs citizenship")
```
## 8.3 Age vs income
```{r}
ggplot(dat, aes(x = age, y = py010n)) +
geom_point(alpha = 0.3) +
geom_smooth() +
labs(title = "Income vs age")
```
## 8.4 Household size vs income
```{r}
ggplot(dat, aes(x = hsize, y = py010n)) +
geom_point(alpha = 0.3) +
geom_smooth() +
labs(title = "Income vs household size")
```
------------------------------------------------------------------------
# 9. Interaction plots
## 9.1 Gender × Citizenship
```{r}
ggplot(dat, aes(x = gender, y = py010n, fill = citizenship)) +
geom_boxplot(position = "dodge") +
labs(title = "Income interaction: gender × citizenship")
```
## 9.2 Age × Gender
```{r}
ggplot(dat, aes(x = age, y = py010n, color = gender)) +
geom_point(alpha = 0.3) +
geom_smooth() +
labs(title = "Income interaction: age × gender")
```
## 9.3 Age × Citizenship
```{r}
ggplot(dat, aes(x = age, y = py010n, color = citizenship)) +
geom_point(alpha = 0.3) +
geom_smooth() +
labs(title = "Income interaction: age × citizenship")
```
## 9.4 Household Size × Gender
```{r}
ggplot(dat, aes(x = hsize, y = py010n, color = gender)) +
geom_point(alpha = 0.3) +
geom_smooth() +
labs(title = "Income interaction: hsize × gender")
```
## 9.5 Household Size × Citizenship
```{r}
ggplot(dat, aes(x = hsize, y = py010n, color = citizenship)) +
geom_point(alpha = 0.3) +
geom_smooth() +
labs(title = "Income interaction: hsize × citizenship")
```
------------------------------------------------------------------------
# 10. Contingency tables (categorical × categorical)
```{r}
table(dat$gender, dat$citizenship)
```
- If categories are too detailed, merge small ones:
```{r}
dat$citizenship <- fct_lump(dat$citizenship, n = 3)
table(dat$gender, dat$citizenship)
```
extra ggplot(dat, aes(x = age, y = py010n)) + geom_point(alpha = 0.3) + geom_density_2d(color = "blue") + labs(title = "Income vs Age with 2D Density Contours", x = "Age", y = "Net Employment Income (py010n)")
extra ggplot(dat, aes(x = age, y = py010n)) + geom_point(alpha = 0.3) + geom_density_2d(color = "blue") + labs(title = "Income vs Age with 2D Density Contours", x = "Age", y = "Net Employment Income (py010n)")
Compares income distributions across citizenship groups, separately for men and women.
## Highlights potential interaction between gender and citizenship.
# 11. Summary
- The income variable is highly right-skewed with outliers.
- Men typically have higher median employment income than women.
- Citizenship differences may indicate structural inequality in wages.
- Age and income show a nonlinear increasing pattern.
- Larger households do not clearly correlate with higher or lower income.
- Some interaction effects appear visible (gender × citizenship, etc.).
------------------------------------------------------------------------
# END OF DOCUMENT